Airbnb Paris dataset analysis

Datacamp 2021

Alexandre PERBET
Cyril NERIN
Hugo RIALAN
Paul ORLUC
Walid CHRIMNI
Zakaria BEKKR



INTRODUCTION

Airbnb is a community platform service that connects travelers with hotel companies, rental property investors, and individuals who rent out all or part of their own home as a spare home. The site offers a search and booking platform between the person offering their accommodation and a renter. It covers more than 1.5 million rental ads in over 34,000 cities and 191 countries. In our study, we will restrict ourselves to the city of Paris.

We will use Machine Learning algorithms to predict the price of an Airbnb rental in Paris.

The use case of our work would be :

Data source : http://insideairbnb.com/get-the-data.html

Libraries

Utility functions

Load Data

Exploratory data analysis

The goal of this part is to compute exploratory data analysis on the dataset to have a good overview of it and gain information.

As some part require accurate information, we also do some little preprocessing to have the best exploratory analysis possible.

Data dictionary

The data from df_main

We notice that the attribute "neighbourhood_group" is not filled in

Location of the apartments for rent

Textual variables

column "name"

We can see that the description of the apartments listed in our dataset are skewed towards upper class paris areas. We can also see that there is especially succesful appartment quality adjectives like : cosy, bright, charming etc

Column "neighbourhood"

The data from df_listings

Construction of the dataset that will be used for the analysis

Cleaning the data set

Dealing with Missing Values

Columns deletion

Neighbourhood_group

We note that the neighbourhood_group has only one unique value which is nan. As a result, this column is not filled in, we then delete it.

License

Moreover,license holds more than half of its content as missing values. Being a column bearing few meaning we choosed to delete it.

last_review

In addition, a relevant default date cannot be specified. We choosed to delete it.

Replacement of missing data with default values

As we can see in the following output, in our new, cleaned data set, we no longer have any missing data:

Outliers processing

An outlier of a dataset is defined as a value that is more than 3 standard deviations from the mean.

Two features are not displayed in the table above: "accommodates" and "bedrooms"

The variable 'price' (target variable)

WARNING: We observe abnormally high prices for the categories "Entire_home/Apt" and "private_room". These outliers on the target variable prevent us from having a good Machine Learning model. We need to remove them.

NOTE: The outliers impact the value of the mean and standard deviation of the values. As our outlier selection algorithm uses these two quantities and as we perform only one selection step, there will be outliers after our treatment but the maximum price obtained will not penalize future processing.

NOTE: The average price and the distribution of prices around this average value are very different according to the variable room_type

Univariate Analysis

Profiling the dataset df

Target variable description

As expected, the most expensive room are the houses where the entire home is available. We don't have much data on the two last categories (shared room and hotel room) so we can't say much about it.

This plot show us the prices according to the neighborhood. The location has a significant impact on the price: for example, a room in Gobelin is cheaper than a room in Luxembourg

Correlation of 'price' with other variables

The coefficient of determination (squared correlation) is calculated to ignore the sign of the value

We note that the variable 'price' is mainly correlated to the variables 'accommodates', 'bedrooms', 'beds' and 'availability_*'. We will therefore reduce the selected variables to improve the readability of the matrix of coefficients of determination.

Feature Engineering

Preprocessing

Train and test data

We keep the proportions of each type of rooms in the train and test datasets

Evaluation of regression models

Selection of best models

Cross validation

Model optimisation

Visualize prediction

Explainability

Interpretability is defined as the ability for a human to understand the reasons for a model’s decision. This criterion has become preponderant for many reasons:

Importance of variables extracted directly from the Model

Conclusion

Synthèse des résultats obtenus

Convert notebook to HTML